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Abstract 

In this paper, we propose a parallel space-time domain decomposition method for 
solving an unsteady source identification problem governed by the linear convection- 
diffusion equation. Traditional approaches require to solve repeatedly a forward 
parabolic system, an adjoint system and a system with respect to the unknowns. 

The three systems have to be solved one after another. These sequential steps are not 
desirable for large scale parallel computing. A space-time restrictive additive Schwarz 
method is proposed for a fully implicit space-time coupled discretization scheme to 
recover the time-dependent pollutant source intensity functions. We show with numer¬ 
ical experiments that the scheme works well with noise in the observation data. More 
importantly it is demonstrated that the parallel space-time Schwarz preconditioner is 
scalable on a supercomputer with over 10^ processors, thus promising for large scale 
applications. 
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1 Introduction 


Pollutant source inversion problems have wide applications in, for example, the detection 
and monitoring of indoor and outdoor air pollution, underground water pollution, etc. In 
the last several decades, physical, chemical and biological technologies have been developed 
to identify different types of sources [3ll35lll6]. In this paper, assuming the pollutant con¬ 
centration data is measured by distributed sensors, we reconstruct the source intensities 
numerically using noise-contaminated data. Like all inverse problems, such a reconstruc¬ 
tion problem is ill-posed in the sense of Hadamard [HlMlIlI]. The lack of stability with 
respect to the measurement data is a major issue, which means that small noise in the 
data may lead to significant changes in the reconstructed source strength. This problem 
has attracted much attention, and various methods have been developed, including both 
deterministic and statistical methods [SUEZ]. Among the deterministic methods, quasi¬ 
explicit reconstruction formulas are available for one-dimensional source location recovery 
problems [ElEo]; and quasi-reversibility methods can be used to retrace the pollutant his¬ 
tory as in [36] ; optimization based methods are also widely used [2l[22l[23l[35lll01S2|. By 
reformulating an inverse problem into an output least-squares PDE-constrained optimiza¬ 
tion problem complemented with Tikhonov regularization, classical optimization methods 
such as regression methods linear and nonlinear programming methods m, linear 
and nonlinear conjugate gradient methods ID SO], Newton type methods, etc. can be 
used to obtain the approximate solutions. These methods can be categorized as sequen¬ 
tial quadratic programming (SQP) methods. Reduced space SQP methods decouple the 
system and iteratively update the state variable, the adjoint variable and the optimization 
variables by solving each subsystem in a sequential order. In some sense this is a block 
Gauss-Seidel iteration with three large blocks. Such methods require less memory due to 
the reduced subproblem size but the number of outer iterations for a specified accuracy 
grows quickly with the increase of the optimization variables, thus they are not ideal for 
supercomputers with a large number of processors. We introduce in this paper a full 
space approach that does not have the three large sequential steps as in the reduced space 
approaches. Similar approaches have been applied to flow control problems in [31]. The 
full space method solves the state variable, adjoint variable and the optimization variables 
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simultaneously, thus avoids repeatedly solving the subsystems. However the fully coupled 
system is several times larger in size and more ill-conditioned, direct methods such as 
Gaussian elimination or LU factorization as well as the classical iterative methods such 
as the Jacobi method, the Gauss-Seidel method are not suitable. To ease the difficulty 
of solving the large system, a preconditioned Krylov subspace technique is considered to 
reduce the condition number and the computing time significantly [9l [Uj . 

The inverse problem of recovering the pollutant source intensity functions can be refor¬ 
mulated into a PDE-constrained optimization problem. In this paper, we derive its con¬ 
tinuous Karush-Kuhn-Tucker (KKT) system [2l], including the state equation, the adjoint 
equation and other derivative equations with respect to each unknown source intensity. 
Two main challenges of the problem lie in that firstly the adjoint equation needs the final 
state of the pollutant source distribution, which implies that the state equation and the 
adjoint equation should be solved in a sequential order; secondly the time marching of the 
unsteady problem is directional and sequential, thus difficult to break down into parallel 
steps. For unsteady PDE-constrained optimization problems, a steady state optimization 
subproblem is solved at each time step [Hj. And in [T5], a block time-marching method 
is used to reduce the number of sequential steps and increase the degree of parallelism. 
In this paper, we propose a fully coupled space-time domain decomposition method that 
couples the time with the space domain and decomposes the “space-time” domain into 
sub-domains, then apply an additive Schwarz preconditioned Krylov subspace technique 
to solve the “space-time” problem. Our algorithm is fully parallel in space and time, 
avoids the sequential time marching steps, and does not need to solve optimization sub¬ 
problems. As far as we know, no published work has achieved such a degree of parallelism 
for time-dependent inverse problems. 

The rest of this paper is arranged as follows. The mathematical model and its corre¬ 
sponding optimization functional, and the derivation of the KKT system are formulated in 
Section [2l The discretization of the KKT system is given in Section [3l The parallel algo¬ 
rithm for solving the KKT system is proposed in Section 01 Some numerical experiments 
are shown in Section [5] and concluding remarks are given in Section [6l 
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2 Model formulation 


We consider a flow domain O € in which several point pollutant sources are present. 
The distribution of the pollutant concentration is denoted by C{x,t) at location x and 
time t. The transport process is modeled by the following convection-diffusion equation 

laisii: 

dC ^ 

— = V • (a(x)VC) - V • (v(x)C) + 5(x - K*)m, 0 < t < T, x G (1) 

1=1 

where fi{t) is the temporal intensity of the f**' source at location x*, z = 1, • • • , s, s is 
the number of sources, a(x) and v(x) are the diffusive and convective coefficient. (5(-) is 
the Dirac delta distribution [5]. The model is complemented by the following boundary 
conditions 

dC 

C(x,t) = p(x,t), xGTi; a(x)—= g(x,t), x € r 2 (2) 

and the initial condition 

(^(x, 0) = Co(x), X G D, (3) 

where Ti and r2 cover the physical boundary dVL = ri|jr2, p(x, t) and q{x,t) are given 
functions for Dirichlet and Neumann boundary condition respectively. If the source loca¬ 
tions x* (i = 1, ■■ ■ ,s) and the corresponding time-dependent intensities /j(t) (z = 1, • • • ,s) 
in ([ll) are all known, then the distribution of the pollutant concentration C{x,t) can be 
obtained by solving the convection-diffusion equation m-m- This is usually called a 
forward or direct problem. In this paper, we are concerned about the inverse problem, 
that is, using the noise-contaminated data C^{x,t) [e is the noise level) of the concen¬ 
tration C{x,t) in D at terminal time T to recover the source intensity functions fi{t) 
(z = 1, • • • , s). In practice, the data C^{x,t) is measured by a sensor network placed at 
some discrete points inside the domain D [261130j . A discussion of the sensor network can 
be found in [26] , but we shall assume that the measurement data is available here at a set 
of uniformly distributed sensors inside D. 

The Tikhonov optimization algorithm is popular for time-dependent parameter iden¬ 
tification problems |22l [23] . The main ingredient of the algorithm includes reformulating 
the reconstruction process as the minimization of the following functional: 

m = l [ {C{x,T)-C%x,T) fdx + N/s{f), ( 4 ) 

J 
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where f = (/i,/ 2 ,-‘‘ ifsY'i and iVa(f) denotes some Tikhonov regularization. Possible 
choices for the regularizations include and BV regularizations. Here we consider 

a combination of the and regularizations in the following form 


"-’(f)=E f +E 

i=l i=l 
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(5) 


where /3|,/32, i = I,-’’ are the or regularization parameters for the source 
intensity /i(t), • • • , fs{t) respectively. The minimization of the functional dH is subject to 
the constraints that C{x, t) satisfies the state equation ([1]) with the boundary conditions ([2]) 
and the initial condition ([3]) . This has transformed the original inverse source problem into 
a PDE-constrained optimization problem. Two kinds of approaches for the optimization 
problem (j3|) are available, the discretize-then-optimize approach and the optimize-then- 
discretize approach. The solutions from both approaches are credible, although they are 
not necessarily the same m- We shall use the optimize-then-discretize approach in this 
work. 

Let and be standard Sobolev spaces with p, g > 0 such that 1/p + 

1/q = 1 and p > 2, q < 2. We formally write ([I]) as an operator equation L{C, f) = 0, then 
introduce a corresponding Lagrange multiplier G G and the following Lagrange 

functional [2l|22l[23]: 


J{C, f, G) = i / (C(x, T) - C=(x, T)fdx + iVp(f) + (G, L(C, f)), (6) 

J 

where G is the Lagrange multiplier or the adjoint variable, and (G, L{G, f)) stands for the 
dual product. 

Taking the variations of ([6]) with respect to G, G and fi, i = 1, ■ ■ ■ ,s, a system of partial 
differential equations is derived to characterize the first-order optimality conditions for this 
optimization problem Q. They are the so-called Karush-Kuhn-Tucker (KKT) optimality 
conditions [23]. It has been verified that the minimization problem (|3|) is equivalent to 
solving the KKT system [24| of the Lagrangian functional J7"(G, f, G) in [TT]. The three 
sets of equations in the KKT system are obtained as follows: 
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(a) The Gateaux derivative of J with respect to G at direction v is given by 


JG{C,^,G)v = {v,L{C,^)) 

= +(aVC,Va) + (V-(vC),u) 

S 

- {q,v)r2 

i=l 


for all V € 


(b) The Goteaux derivative of J" in ([6]) with respect to G at direction w € is 

given by 


JciG,¥,G)w= [ (C(x,r)-C"(x,r))u;dx 
Jn 

J G ^ ■ (o(x)Vt(;) + V • (v(x)rt;) ) dxdt. 


+ 


For convenience, we write 


(7) 


Lw ;= —— V • {a{x.)'Vw) + V ■ (v(x)r(;, 
and obtain by integrating by part for the second term of ([7]) that 

{G,Lw) = j j G —V ■ {a{x)Vw)+ V ■ {v{x)w)^ dxdt 

= [ Gtcl^dx— [ [ w-^—dxdt— [ [ a(x)G-^dTdt 

Jn Jo Jn dt Jo Jan ^ ^ dn 

+ / a{x)Vw ■ VGdxdt + / / V ■ {v{x)w)Gdxdt 

Jo Jn Jo Jn 

= [ Gw\odx+ [ [ (—a(x)^J^G] dVdt 

Jn Jo Jdn V on J 

L + a{x)Vw ■ VG + V • (v(x)r(;)G^ dxdt. 

Then applying the boundary and initial conditions of w, i.e. u; = 0 on Ti and 


a(x)|^ = 0 on r 2 , u)(x,0) = 0, we derive 


iG,Lw) = 


+ 



G(x, T)w{x, T)dx + 



dVdt 



—w 


8G \ 

— + a(x)Vw • VG + V • (v(x)u;)G 
at J 


dxdt. 
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Now noting the arbitrariness of w, we can deduce the adjoint system for the Lagrange 
multiplier G, namely G{x,T) = 0 for x € n, G(x, t) = 0 on Fi and G(x,t) satisfies 


- {Gt, w) + {aVG, Vw) + (V • (vu;), G) = -{6{t - T){G{;t) - G%;t)),w) (8) 

for all w G such that tc = 0 on Fi, where d{t — T) is the Dirac delta 

distribution at t = T. 


(c) The Gateaux derivative of J in Q with respect to fi at direction g G is 

given by 

Jo Jo 

- [ {Gix,t),5{yL-x*)g{t))dt (9) 

Jo 

= [ iPihit) - G{x*,t))g{t)dt +PI [ f'{t)g'{t)dt. 

Jo Jo 


Putting (a)-(c) together, the KKT system is formulated as follows: 

/ 

JG(C,f,G)u = 0 

< Jc(C,f,G)u; = 0 (10) 

J/,(C',f,G)5 = 0, z = 

\ 

that is, for any v G and w G VF^’'^(D), we have the following coupled system: 

+ (aVC, Vu) + (V • (vC),?;) - ^u(x*)/i(t) - (g,u)r 2 = 0 

^ ^ i=l 

< ~ (^’ ^) ^ ( 11 ) 
+{5{t-T)iGi;t)-Gh;t),w)) = 0 

-iGi^*, ■),g) + Piih,9) + Plifh 5') = 0, f = 1, • • • , 5 

with C(x, 0) = C'o(x), G{x,T) = 0. The rest of the paper is devoted to solving (fTTD as a 
coupled space-time system. It is noted that the first equation in ([TT]) is the state equation, 
and the second equation is the adjoint equation, and the last set of equations are elliptic 
equations with respect to each unknown source intensity. 
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3 Finite element discretization 


Let be a triangulation of with triangular elements, then we define as the finite 
element space [12] consisting of continuous piecewise linear functions on and the 
subspace of with functions vanishing on the Dirichlet boundary Li. To fully discretize 
the system m, we partition the time interval [0,T] as 0 = < ■ ■ ■ < = T, with 

t"- = nr, r = T/M. Define as a piecewise linear continuous finite element space in 
time. For a given sequence {Lf"’(x) = Lf(x, t”)}, we define the difference quotient and the 
averaging function respectively by 

( 12 ) 

r 2 

Let TTfi be the finite element interpolation associated with the space V^, and be the 

finite element approximation of then we discretize the state and adjoint equa¬ 

tions of the system (llip by the Crank-Nicolson scheme in time and piecewise linear finite 
elements in space, and lastly we use piecewise linear finite element in time to discretize 
the equations with respect to f. The finite element approximation of the KKT system 
m can be formulated as follows: 

Find a sequence of approximations CJ^, G (0 < n < M), f[ G U'^,i = 1, • • • ,s, 
such that = tthCo, Gff = 0 and C'^{x) = 7rhp{x, G), GjJ(x) = 0 for x G Fi satisfying 

{drGJi, Vh) + {aVCJi, Vvh) + (V • {^rCJi),Vh) 

= VufeGlL^, n = l,--- ,M 

< -{drG^^, Wh) + {aVGl, Vwh) + (V • G^) 

= -XnaGJi-Gn,Wh), \/wheV\ n = M,--- ,1 

-{G-{x*,-),g-) + = 0, i = ,s, n = 0, • • • ,M, 

(13) 

where Xn = ^ when n = M and 0 when 1 < n < M. We denote the basis functions of 
finite element spaces and 17’’ by </>*, i = 1, • • • ,N and g"^, n = 0, • • • , M, respectively. 





and introduce the following matrices: 


B = 1 

B — ,Nj 

B — {knm)n,m=0,---,M 1 
B — {dnm)n,m=Q,---,M 1 


(Xij — (uVV (pj ) 


bij — {(j)i,(j)j) 

eij = (V • {^r(t>i),4>j) 

knm = {{gn'Ar)') 

dnm = {g^,9n 


and the vectors 


C^ = {C^,C^,--- ,C^f, for n = 0,---,M 

G^ = iG^„G^,--- ,G%f, forn = 0,---,M 

fk = (/°, fl,-- - , fkf, for A: = 1, • • • , s 
r- = (r-,r- for j = 1, ■ ■ • , iV, n = 1, • • • , M with 


= -T 




^fe=l 

^0 


) Y] 


^2, 


gk = (gk^gir-- ,g^f, with = G(x^,r), for fc = l,--- ,s, n = 0,---,M 


d = {di,d 2 , - ■ ■ , d]\f), with dj = (C^, cpj) for j = 1, - ■ ■ , N. 


The matrix form of the KKT system is then reformulated by using the above notations as 
the following: 


(^B + ^{A + E)'^G^+(^-B + ^{A + E)^G^-^+r^ = 0, n = l,-- - ,M 

+ ^(A + E^)^ G^+(^B + ^-{A + E^)) G^-^ 


(14) 


+TXn{BG'^ - d) = 0, n = M, • • • , 1 

-Dgl + {p\D + p^K)fk = Q, A: = l,--.,s. 


We can follow the approaches in [2T1 [22l [23l 02] to obtain the convergence of the discretized 
problem (ll.Sp to the continuous optimization problem Q. 
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4 A space-time domain decomposition method for the KKT 


system 


4.1 Fully coupled KKT system with special ordering of unknowns 

The ordering of the unknowns for the discretized KKT system (jl3p has significant influence 
in the convergence and computing efficiency of the iterative solver. Traditional reduced 
space SQP methods split the system into three subsystems and solve each subsystem for 
C, G, and f one by one in sequential steps m, in this case the unknowns are ordered 
physical variable by physical variable. To develop a scalable and fully coupled method 
for solving the KKT system, we use the so-called fully coupled ordering, the unknowns C 
and G are ordered mesh point by mesh point and time step by time step. At each mesh 
point Xj, j = 1, • • • and time step f”, n = 0, ■ ■ ■ , M, the unknowns are ordered in the 
order of G^,G'j,j = I,-- - ,N,n = 0, ■ ■ ■ ,M. Such ordering contains unknowns of the 
same space-time subdomain in a subblock, preconditioners such as additive Schwarz can 
be applied naturally to each subblock of the fully coupled KKT system and the ordering 
also improves the cache performance of the LU factorization based solvers. Since f is 
defined only in the time dimension, we put all the unknowns of f at the end after G and 
G. More precisely, we define the solution vector U by 


TT _ /'/^O /^1 /^1 

U — • • • , Otv, • • • , 




1^1 ? 




M /^M 
N 5 


/■O . . . fC 

? J 1 ? •> J S 


fM £ 

Jl J I 




then the linear system (fT3|) with unknowns G^ and j = - ,iV, n = 0,--' ,M, and 

/^, fe = 1, • • ■ , s, n = 0, • • • , M, is reformulated into the following linear system: 

FC/ = 6, (15) 
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where F is a sparse matrix of size (M + 1)(2A^ + s) x (M + l)(2iV + s) derived from the 
finite element discretization for KKT system (I14p with the following block structure: 


Soo 

Soi 

0 


0 

So,M+l 

Sio 

Su 

5i2 


0 


0 


Sm-1,M-2 


Sm-i,m 

Sm-i,m+i 

0 


0 

Sm,m-i 

Sm,m 

Sm,m+i 

1 Sm+1,0 

Sm+1,1 

Sm+1,2 


Sm+i,m 

Sm+i,m+i 


and b has the following form correspondingly: 

b = (bo, bi, - ■ ■ , bM+iY■ 

In the matrix F, the block matrices Sij, with 0 < z,j < M are of size 2N x 2N and are 
zero matrices except the ones in tridiagonal stripes The stripe 

0 < i < M are nonzero sparse blocks of size 2N x s(M + 1); furthermore 
0 < i < M are nonzero sparse blocks of size s(M + 1) x 2N and Sm+i,m+i is a 
nonzero tridiagonal matrix of size s{M + 1) x s(M + 1). 

4.2 Space-time Schwarz preconditioners 

The KKT system (IlSp is usually large in size and severely ill-conditioned. In traditional 
reduced space SQP methods, the subsystems corresponding to unknowns C and G are 
time-dependent, time-marching algorithms starting from the initial or terminal moment 
are applied. But we notice that the adjoint equation needs the concentration distribution 
of C at the terminal time t = T, which means that the state equation and the adjoint equa¬ 
tion should be solved in a sequential order. In addition, sequential steps within reduced 
space SQP methods exist between both the KKT subsystems and the time marching for 
time-dependent inverse problems, thus are quite challenging for efficient parallelization. 
To overcome the lack of parallelism in SQP methods, we shall propose to solve the fully 
coupled system (fT^ all at once. This is a very large system, the all-at-once method is 
traditionally regarded as a very expensive approach and not suitable for small computers. 
But on high-performance computers, especially on the upcoming exascale computers, we 
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believe this approach is more attractive than the reduced space methods. It is well known 
that a direct solver such as Gaussian elimination or LU factorization is not suitable for 
very large problems due to the lack of parallel scalability. We shall use a preconditioned 
Krylov subspace method, where some preconditioning technique will be introduced for 
reducing the condition number of the KKT system and accelerating the convergence rate 
of the Krylov subspace method. Various preconditioners have been developed and applied 
for various elliptic and parabolic systems, such as the (block) Jacobi method, (incomplete) 
LU factorization, (multiplicative) additive Schwarz method, multigrid method, multilevel 
method, etc. HlEllg]. Among these preconditioners the Schwarz type domain decom¬ 
position method is shown to have excellent preconditioning effect and parallel scalability 

mm- 

We shall propose a “space-time” Schwarz type preconditioner for the unsteady inverse 
problems. Different from the classical Schwarz type preconditioning technique which only 
decomposes the space domain, we want full parallelization in both space and time. The 
idea of space-time parallel algorithm comes from the parareal algorithm, proposed by 
Lions et al. in [2S]. The parareal algorithm is an iterative method which involves a 
coarse (coarse grid in the time dimension) solver for prediction and a fine (fine grid in 
the time dimension) solver for correction. An insight on the stability and convergence 
of the parareal algorithm was given in |16l I38j . Parareal algorithm has been applied to 
solve problems in molecular dynamics [4], fluid and structural mechanics m, quantum 
control [28] etc. However in the implementation of the parareal algorithm, the scalability is 
determined largely by the coarse time step and the space discretization scheme. In [29], the 
parareal algorithm was combined with domain decomposition in space to achieve higher 
degree of parallelization. Different from the parareal algorithm, the new “space-time” 
Schwarz type preconditioner treats the time variable and the space variables equally, so 
the physical domain is a “space-time” domain, instead of the conventional space domain. 
We apply a domain decomposition technique to the coupled “space-time” domain. 

In each “space-time” subdomain, a time-dependent subproblem with vanishing space 
boundary conditions and vanishing data at “artificial” initial and terminal time is solved. 
The same as the global problem, no time-marching is performed in each subproblem, 
all unknowns associated to the same space-time subdomain are solved simultaneously. 
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Time overlap "Terminal" time 



Figure 1: “Space-time” domain decomposition - an overlapping subdomain with boundary 
conditions. 

The proposed “space-time” Schwarz preconditioner eliminates all sequential steps and all 
unknowns are treated at the same level of priority. We use a right-preconditioned restarted 
GMRES to solve the system (fT^ : 

YM-^U' = b, 

where M~^ is a “space-time” additive Schwarz preconditioner and U = M~^U'. 

To formally define the preconditioner M~^ we need to introduce a partition of the 
space-time domain x [0, T], denoted by 0. Firstly we decompose the domain into 
nonoverlapping subdomains bl,, i = 1, • • • and then divide the time interval [0,T] 
into subintervals Tj = j = 1, 2, • • • , A^ 2 j and 0 = to < < • • • < tN^ = We 

remark that the time partition here is coarser than that used in the full finite element 
discretization described in Section [21 and each interval Tj contains a few consequent time 
intervals 0 consists of Qij = Q, x Tj, i = I,-- - ,Ni,j = 1, • • • ,A^ 2 - In order 

to obtain an overlapping decomposition of 0, we extend each subdomain to a larger 
region fl' and each subinterval Tj to a longer interval Tj, satisfying flj C fl', Tj C Tj. 
Now each 0jj can be straightforwardly extended to 0T = x Tj with Qij C 0b. The 
sizes of 0b are chosen so that the overlap is as uniform as possible around the perimeter 
of interior domains 0b c 0. For boundary subdomains we neglect the part outside of 0. 
See Figured] for an illustration of the space-time domain decomposition. 

We denote the size of the KKT matrix F by iV x IV, clearly there are two degrees of 
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freedom at each mesh point corresponding to the state variable C and the adjoint variable 
G. The unknown time-dependent source intensity variables are allocated on the same 
processor as the last space-time subdomain 0^^ jy^. 

On each extended subdomain 0'■, we define the Nij x N matrix its 2 x 2 block 
element {Rfj)ii,i 2 is either an identity block if the integer indices h and I 2 are related to 
the same mesh point and time step and they belong to 0T or a zero block otherwise. 
The multiplication of Rfj with an iV x 1 vector generates a shorter vector by keeping all 
components corresponding to the subdomain 0T. R^j is defined similarly as Rfj, with the 
difference that its application to a iV x 1 vector excludes the mesh points in 0b\0jj. Now 
for each space-time subdomain we have defined the following local problem: 

— = -V ■ (a(x)VG) - v(x) • VG 

< + S{t-T){C{x,t) - (x,t)€0b (16) 

— = V • (a(x)VG) - V • (v(x)G) + 5(x - x*)/,(t), (x, t) € 0',.. 

< 2 = 1 

It is complemented by the following boundary conditions 

G(x,t) = 0; G(x,t)=0, X € Sn' (17) 

along with the “initial” and “terminal” time boundary conditions 


G(x,tj_i) = 0; 

G(x,tj_i) = 0 

(18) 

G(x,tj) = 0; 

G(x, tj) = 0. 

(19) 


For the last subdomain, we include the additional variables corresponding to the source 
intensities fi, i = 1, ■■ ■ ,s satisfying 

f3y'' + /3ifi + G{^*,-) = 0, ( 20 ) 

with the Neumann condition 

f'{t) = 0, t = 0,T. (21) 

We remark that (US]) is a parabolic system and it is usually “illegal” to impose both 
initial and terminal conditions dUD-dlSD. However, as inexact local solvers on space-time 
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subdomains that form the global preconditioner, such local boundary conditions work well 
as we shall see from our numerical experiments. Similar boundary conditions are used in 
the context of hyperbolic subdomain problems |43j . 

Let Mij be a discretization of (|16ll - (ll9p and be an exact or approximate inverse 
of Mij. The space-time additive Schwarz preconditioner is defined as 

N2 Ni 
j=i i=i 

It is noted that the last space-time subdomain solver is an inverse or an approxi¬ 

mate inverse of the matrix arising from the discretization of the subproblem (I16l) - (ll9p of 
®7Vi N 2 (l20]l - (l2T]l . Although its construction is slightly different from that of the other 
subdomain inverse matrices, we still use the same notation. 

In addition to the standard additive Schwarz method (ASM) described above, the 
restricted version (RAS) of the method developed in [TO] for standard space domain de¬ 
compositions is also widely used. So we extend it to our current space-time domain 
decomposition, then the space-time RAS preconditioner is defined as 

N2 

j=l i=l 

For some applications, RAS achieves better preconditioning effect with less communication 
time since one of the restriction or extension operations does not involve any overlap. We 
use the restricted version in our experiments to be presented in the next section. 

We remark that, computationally, the matrix Mij can be obtained as RfjF{Rfj)'^. 
Moreover, if A ^2 = 1, then no time partition is performed in the time dimension, M^]. is 
a “space-only” domain decomposition preconditioner for the fully coupled KKT system. 

5 Numerical examples 

We present in this section some numerical examples of recovering the intensity functions 
fi{t) (i = 1, • • • , s) at given source locations ,x*. We set the test domain to be 

n = (—2, 2) X (—2, 2), and the terminal time at T = 1. We denote the time step by rit and 
the number of mesh points in x and y directions by and Uy, respectively. Homogeneous 
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Dirichlet and Neumann boundary conditions are imposed on Fi = {x = (xi, X 2 ); \xi \ = 2} 
and r 2 = {x = (xi,X 2 ); |x 2 | = 2}, respectively. The diffusive coefficient a(x) and the 


convective coefficient v(x) are chosen to be 1.0 and (1.0,1.0)^, respectively. 


The preconditioned KKT coupled system will be solved by the restarted GMRES 
method with a restart number 50 m- For clarity, we provide the restarted GMRES 
algorithm below for a general linear system Ax = 6: 

Algorithm 1 Restarted GMRES method with a restart number m 

1. Gompute xq = b — Axq, /3 = ||ro|| 2 , and vi = ro//3; 

2. Generate the Arnoldi basis and the matrix using the Arnoldi algorithm 
starting with vi] 

3. Gompute ym which minimizes ||/3ei — Hmy \\2 and Xm = xq + Vmym', 

4. Stop the iteration if the stopping criteria are satished; otherwise set xq := Xm 
and go to 1. 

In the algorithm above, ei is the first column of the identity matrix, Rn stands for the nxm 
matrix with columns xi, • • • , Vm, and Hm denotes the (m + 1) x m Hessenberg matrix [33]. 
The computational cost is overwhelming and becomes more unstable numerically when m 
is large. So we should set m to a reasonable number and get restarted when the stopping 
criteria are not satisfied. 

For dehniteness, we shall denote as a cell the smallest space-time element after the 
space triangulation of 12 and time partition of the interval [0,T]. The size of overlap, 
that is the number of overlapping cells, is denoted by iovlp and set to 4 unless otherwise 
specified. The subsystem is solved with a sparse LU factorization or an incomplete LU 
factorization (ILU) with the fill-in level denoted by ilulevel. We still take the linear system 
Ax = h for example to explain the definition of the fill-in level of an ILU factorization. 
Based on the Gaussian elimination, each location (i,j) of the sparse matrix A has a level 
of hll, denoted by levjj, which should indicate that the higher the level is, the smaller the 
element is [33] : 

1. The initial value of fill of an element aij of A is dehned by 


levjj = 

otherwise 



f) if ttij ^ 0 or i = j 
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2. Each time in the process of Gaussian elimination, the element aij is updated in 
the loop of k by aij = aij — aikUkj, the corresponding level of fill is updated by 
levij = min{levjj, levjfc + levkj + !}• 

In ILU with fill-in level p, an element whose level of fill levjj does not exceed p will be 
kept. So the larger the level of fill p is, the more elements are kept in the factorization. 

For the scalability test we use LU factorization as the subdomain solver and incomplete 
LU factorization with ilulevel = 3 for the other tests if not specified. The relative residual 
convergence tolerance of GMRES is set to be 10“®. The algorithm is implemented based 
on the package Portable, Extensible Toolkit for Scientific computation (PETSc) [6] and 
run on a Dawning TC3600 blade server system at the National Supercomputing Center 
in Shenzhen, China with a 1.271 PFlops/s peak performance. 

We use a high resolution numerical solution of the concentration at the terminal time 
t = T as the noise-free observation data. In other words, we first solve the forward 
convection-diffusion system dl])-® on a very fine mesh, 640x640, with a small time stepsize 
r = 1/160, then add a random noise of the following form to the terminal concentration 

C%x) = {l + er)C{x,T), 

where r is a random function with uniform distribution in [—1,1], and e is the noise level. 
In our numerical experiments, e is set to 1% if not specified. 

5.1 Reconstruction results and parallel efficiency tests 

The tests are designed to investigate the recovery effect of the pollutant source intensity 
functions and to understand how the solution of the KKT system behaves when using 
different mesh sizes, time steps, regularization parameters and number of processors, which 
is denoted by np. Supposing the source locations are known, we consider the following 


four examples. 



( 1 ) f = 


= (1.0,1.0)^. 

(2) / = ^((1 - t) ( 

'1 V 

--t) +1.0. 

xt = (1.0,1.0f. 

(3) h=t^, 


xj = (1.0,-1.0)^ 

T't 

h = -t{l-t) 


X* = (0.0,0.0)'^. 
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np 

Its 

Time(sec) 

Speedup 

Ideal 

256 

145 

794.70 

1 

1 

512 

181 

377.16 

2.11 

2 

1024 

242 

161.23 

4.93 

4 


Table 1; Scalability test for Example 1: rit = 160, Ux = 160, Uy = 160, DOF = 
8,192,160. 

(4)/i = t2, ^ = (34/79,24/79)^ 

/2 = ^t(l - i) + 1-0, = (14/79,14/79)^ 

h = S-t, x5 = (25/79,15/79)^. 

Example 1. This is an example of recovering a quadratic polynomial source intensity 
function. An regularization is applied and the parameter is chosen to be /32 = 10““^ 
(/3i = 0). Figure [2] shows the reconstructed result with mesh = 80, = 80 and 

time step nt = 320, when 64 processors are used. The blue dotted line represents the 
reconstructed source intensity which is quite close to the red true shape. This shows that 
the time-dependent intensity is successfully recovered by the algorithm. 

We present the strong scalability results in Table [TJ Sparse LU factorization is applied 
as the subdomain solver. The spatial mesh is 160 x 160 and the number of time steps is 
160. The total degrees of freedom is 8,192,160. As the number of processors increases, the 
computing time decreases significantly and superlinear speedup is obtained, for np < 1024, 
in Figure[3l Since the number of processors is the same as the number of subdomains, more 
processors lead to an increasing number of iterations. This suggests that the condition 
number of the preconditioned KKT matrix depends on the number of subdomains. Similar 
dependency was proved for elliptic problems [3U] . 

We fix the number of processors to np = 128, the mesh to = 80, Uy = 80 and the 
time step n* = 320, then test several choices of regularization parameters. ILU factoriza¬ 
tion is used as the subdomain solver with the fill-in level being ilulevel = 3. From the 
results in Table O as (32 becomes smaller, the number of GMRES iterations increases, and 
no significant change is observed for the total computing time. 
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Figure 2: Comparison of analytical and computed solution for Example 1. 




Figure 3: The speedup (left) and computing time (right) for Example 1. 


(32 

Its 

Time(sec) 

10-4 

88 

54.91 

10"® 

93 

55.75 

10-6 

96 

56.68 


Table 2: regularization parameter test for Example 1: nt = 320, = 80, 

80, DOF = 4,096, 320, np = 128. 
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np 

Its 

Time(sec) 

Speedup 

Ideal 

256 

178 

852.23 

1 

1 

512 

184 

379.75 

2.24 

2 

1024 

247 

176.23 

4.84 

4 


Table 3; Scalability test for Example 2: rit = 160, Ux = 160, Uy = 160, DOF = 
8,192,160. 

Example 2. This is an example of recovering a polynomial source intensity function 
of degree 4. We set /3i = 0 and use an regularization with (32 = 10“^. Satisfactory 
result is shown in Figure 0] with mesh = 80, = 80 and time step nt = 160, when 64 

processors are used for the computation. 

Using the same parameter settings as in Example 1, we perform the strong scalability 
test and the results are given in Table [3] and Figure [5l Superlinear speedup is obtained 
when np < 1024. Next we test three sets of mesh and time step size in Table 01 The 
regularization parameter is set to be (32 = 10“®, and 64 processors are used. The overlap 
iovlp = 4 and the fill-in level of ILU ilulevel = 3. We observe from Table 0] that as the 
mesh and the time step size become finer, the number of GMRES iterations grows slightly, 
and the computing time increases with the problem size. 

Now we investigate the performance of the space-time Schwarz preconditioner. An 
important feature of the proposed space-time Schwarz preconditioner lies in the paral¬ 
lelization in the time dimension. If the time range is not partitioned as mentioned in the 
end of Section 14.21 the preconditioner also works from the result in Figure El but it is 
observed from Table 01 under the same settings, that the “space-only” Schwarz precon¬ 
ditioner costs more iterations and computing time compared to the space-time Schwarz 
preconditioner. Thus the Schwarz preconditioner with a partition in the time is more 
efficient than the “space-only” domain decomposition preconditioner. In the end of this 
example, we perform the noise level test and the results are given in Figure [71 The re¬ 
sults agree with our expectation that the reconstruction accuracy deteriorates with the 
increasing level of noise in the measurement data. 
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2.6 



t 


Figure 4: Comparison of analytical and computed solution for Example 2. 




Figure 5: The speedup (left) and computing time (right) for Example 2. 


nt 

Ux X ny 

DOF 

Its 

Time(sec) 

100 

40 X 40 

320 100 

40 

8.44 

240 

64 X 64 

1 966 320 

46 

33.25 

320 

80 X 80 

4 096 320 

49 

69.28 


Table 4: Mesh size and time step size test for Example 2: /32 = 10 np = 64. 


Preconditioner 

np 

Its 

Time(sec) 

np 

Its 

Time(sec) 

space-only 

64 

49 

114.34 

256 

129 

230.29 

space-time 

64 

39 

37.60 

256 

83 

156.64 


Table 5: Preconditioner comparison for Example 2: /32 = 10 nt = 100, = 40, Uy 

40, DOF = 320,100 for np = 64; /32 = 10-^ n* = 320, = 80, Uy = 80, DOF 

4, 096,320 for np = 256. 
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Figure 6: Space-only ASM preconditioner (left) vs. space-time ASM preconditioner (right) 
for Example 2. 





Figure 7: Noise level test for example 2. Top left: e = 1%, top right: e = 5%, bottom: 
e = 10%. nt = 240, = 64, Uy = 64, DOF = 1, 966,320, /32 = 10■^ np = 64. 
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Example 3. In this example, we test the recovery of two source intensities. Compared 
with the single source intensity cases, more regularization parameters are needed. Here 
we test the following two sets of regularization parameters with np = 64: 

(1) = 0,/3f = for the source intensity function fi and = 10“® for 

f 2 - The mesh is Ux = 80, Uy = 80 and the time step is nt = 80; 

(2) = 0,/3f = 10“® for /i and I5\ = 0,/3| = 10~® for / 2 . The mesh and the time step 
are set to be Ux = 64, Uy = 64 and nt = 256. 

The numerical results are shown in Figure [HI The computed fi matches with its original 
data perfectly, but the computed /2 is less accurate. As we see in the tests for Example 
1 and Example 2, /2 is physically harder to recover than the simpler function fi. Overall 
the reconstruction effect for both source intensities are reasonable. 

Next we show the strong scalability results in Figure [9| and Table [6l We still observe a 
superlinear speedup, although it is a bit worse than that of Examples 1 and 2. It implies 
that it is more difficult to separate and identify multiple source intensities than the single 
source case. And from our previous experiments with reduced space SQP methods, the 
recovery of multiple sources is much more difficult to converge than that of the single 
source case. 

Lastly, we test the algorithm with parameters such as the fill-in level of ILU factoriza¬ 
tion and the size of overlap in Table |7| and Table [HI respectively. 

It is observed that the number of iterations decreases with the increase of the overlap¬ 
ping size or the fill-in level, however, it costs more communication time when we increase 
the overlap between “space-time” subdomains, and more computing time is used in the 
preconditioning stage when we raise the fill-in level of the ILU factorization. 

Example 4. We now test the numerical reconstruction for three point sources, and 
observe how the speedup changes with increasing number of sources. 

For this test, we take the spatial mesh 160 x 160 and the number of time steps 160. 
Regularization parameters are respectively set to be I5\ = Q,I3\ = 10“^, = 10““^,/3| = 

10“® and fii = 10“^, = 8 x 10~®. From Figure [TOl we see that, apart from the initial 

part of the third intensity which is not quite close to the true values, the rest are recovered 
satisfactorily. Now we use a 160 x 160 space mesh and 160 time step to test the strong 
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Figure 8: Recovery of two source intensities using two sets of regularization parameters. 
Left: P\ = 0,/3? = 10“'^,= 10"®,= 10“®, right: = 0,/3? = 10"®,/3^ = 0,/3f = 

10 -®. 


up 

Its 

Time(sec) 

Speedup 

Ideal 

256 

148 

765.80 

1 

1 

512 

150 

360.88 

2.12 

2 

1024 

211 

178.37 

4.29 

4 


Table 6: Scalability test for Example 3 with two point sources: n* = 160, Ux = 160, ny = 
160, HOF = 8,192, 320. 




Figure 9: The speedup (left) and the computing time (right) for Example 3. 
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ilulevel 

Its 

Time(sec) 

1 

496 

75.98 

2 

295 

85.12 

3 

247 

116.11 


Table 7: Fill-in level of ILU test for Example 3: = 10 /3| = 10 n* = 400, n^, 

80, Uy = 80, DOF = 5,120,400, np = 128. 


iovlp 

Its 

Time(sec) 

1 

- 

- 

2 

432 

114.22 

4 

247 

121.26 

6 

238 

400.12 


Table 8: Overlap test for Example 3: = 10 = 10 nt = 400, Ux = 80, Uy 

80, HOF = 5,120,400, np = 128. 



Eigure 10: The reconstruction for three point sources. 
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np 

Its 

Time(sec) 

Speedup 

Ideal 

256 

148 

794.19 

1 

1 

512 

150 

383.56 

2.07 

2 

1024 

211 

194.14 

4.09 

4 


Table 9: Scalability test for Example 4 with three point sources: n* = 160, = 160, Uy = 

160, DOF = 8,192,480. 




Figure 11: The speedup (left) and the computing time (right) for Example 4. 

scalability and compute time in Figure [11] and Table (9] LU factorization is used as the 
subdomain solver. It is observed that the speedup for three point sources is almost linear, 
still satisfactory but a bit worse than Examples 1,2 and 3. As a conclusion the speedup 
deteriorates slowly with the number of unknown point sources. 

5.2 Comparisons with two reduced space SQP methods 

A reduced space method for reconstruction of the location and intensity of a single point 
pollutant source was developed in m- With the source location known in our current 
case, the process of reconstructing the source intensity described in m can be stated as 
follows: 
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Algorithm 2 Nonlinear CG method 

Select the initial guesses /°, and set fe := 0. 

For k — 1 , 2 , • • • , Nrnax 

Solve the state system (the first equation in (fT3l) i for {C'^(/^)}; 

Solve the adjoint system (the second equation in (fT3]) i for {G^(/^)}; 

Apply the nonlinear CG method to update /^: + oi\d^] 

Stop the iteration if the stopping criteria are satisfied; otherwise set k := k + 1. 


We use the Fletcher-Reeves (FR) formula to update the nonlinear CG direction d^: 
d^ = J'k + lkd^~^, with dP = Jq and 7 ^ = ||-^fc|P/||-^fc_ilP and J(, = -{Jh)'{f^) being the 
negative gradient direction which is obtained from formula ([9]). That is, 


{M\t) - Gi{^\tM{t)dt + /32 j\f^)'{tW)'{t)dt 


f — 

We select the stepsize such that aj = argmin^^QJ^(/^ + ydfc)- For the and 
regularizations in (l5|), we can work out the exact formulae: 


Oi = — 


= 


(Af,Af) + /3i(#,#) 
(cf(/")-c%Af) + /32((/^y,(d"y 
(Af,Af)+/32((#y,(#y) 


(L^ regularization), 


regularization). 


where Aff = Cff {f^)'dP is obtained by solving the following sensitivity equation, 
{drAl^Vh) + {aVAl.Vvh) + (V ■ {^Al),Vh) = Ufe(x*)(#)", 'dvh^V^, n = 1, - • • ,M, 


with A° = 0. At each iteration, three time-dependent subsystems are solved. When we 
implement this nonlinear CG method on parallel computers, we need to develop a parallel 
solver for each subsystem. We will test two cases: the first one uses the space domain 
decomposition preconditioner but keeps the time marching process, while the second one 
uses a space-time domain decomposition preconditioner as it is developed for the fully 
coupled system in this work and solves each time-dependent subsystem all-at-once. These 
two parallel solvers are denoted by RS(1) and RS(2) respectively. We shall compare the 
computing times between our proposed space-time preconditioning method, denoted by 
FS, and the two reduced space SQP methods RS(1) and RS(2); see Table fTOl We use the 
three aforementioned methods to implement Example 2 with four sets of meshes, and the 
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np 

nt 

rix X Uy 

Solver 

Time(sec) 

64 

40 

40 X 40 

FS 

12.064 




RS(1) 

418.580 




RS(2) 

125.484 

128 

80 

80 X 80 

FS 

15.525 




RS(1) 

682.794 




RS(2) 

99.528 

256 

160 

80 X 80 

FS 

23.736 




RS(1) 

994.962 




RS(2) 

200.543 

512 

320 

160 X 160 

FS 

136.717 




RS(1) 

7240.881 




RS(2) 

1094.886 


Table 10: The computing time of the proposed full space method FS, the reduced space 
method RS(1) and RS(2). 

number of processors increases with the refinement of the meshes. The subdomain solvers 
for all three kinds of methods are ILU. We firstly compute the result by the FS method 
with zero initial guess and record the error accuracy e = ||/ — /*||. Then we use the same 
initial guess and the error bound e for the reduced space methods RS(1) and RS(2), and 
set the stopping criterium as ||/^ ~ /*|| < e. In this way we can compare the computing 
time for all these methods. 

As shown in Table fTOl the computing time of the FS method is much less than the 
ones of RS(1) and RS(2). For the two reduced space methods, RS(2) using the space-time 
domain decomposition solver is faster than RS(1) keeping the time marching process and 
using a space domain decomposition solver. We can see that the space-time fully coupled 
preconditioner is much better for parallellization, and the all-at-once method for the fully 
coupled KKT system is always more efficient than the reduced space iterative optimization 
method on parallel systems. 
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6 Concluding remarks 


We developed a new space-time domain decomposition method for unsteady source inver¬ 
sion problems. The main ingredient of our algorithm includes solving the fully coupled 
KKT system by GMRES iteration with a space-time additive Schwarz preconditioner. 
Although the size of the linear system is significantly increased compared to the reduced 
space SQP methods, the one-shot method avoids the sequential step between the state 
equation and the adjoint equation, as well as the time-marching process in the time di¬ 
mension, and thus achieves higher degree of parallelism. This is well confirmed by the 
numerical results shown in the last section. Another advantage of the new method is 
that the recovery of multiple sources is obtained using the same algorithmic and software 
framework as the single source case, and the framework is easily extended to recover other 
kinds of source intensities. 

We have observed from the numerical examples that the new space-time additive 
Schwarz method is quite robust also with respect to the noise in the observation data. 
It is important to note that the new space-time method is highly parallel and scalable, 
and extensible naturally to three-dimensional problems. 
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